In [1]:
# Part 1: Data cleaning
# imports packages
import pandas as pd
import numpy as np
from datetime import datetime,date,timedelta
In [2]:
# Part 2: Visualization and EDA
import matplotlib.pyplot as plt # basic plotting
import seaborn as sns # for prettier plots
import plotly
import plotly.graph_objects as go
import plotly.express as px
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
from matplotlib.dates import DateFormatter
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from IPython.display import HTML
In [3]:
#Part 3: Model Selection
import itertools
from itertools import product
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pmdarima as pm
from fbprophet import Prophet
from fbprophet.plot import add_changepoints_to_plot
In [4]:
# Part 4: Second dataset
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import scipy.stats as scs
from covid19dh import covid19
from sklearn.linear_model import LinearRegression  
from scipy.stats import shapiro
import pingouin as pg
from pingouin import welch_anova
import statsmodels.stats.multicomp as mc

# settings
import warnings
warnings.filterwarnings("ignore")

# show plots inline for notebooks
%matplotlib inline

An Analysis of COVID-19 confirmed cases in New York: Is the movement restrictions important to prevent the COVID-19 pandemic?

Introduction

The goal of this project is to develop data science or machine learning models that forecast COVID-19 cases with the aim of helping policymakers plan for the days and months ahead, and take action to change the course of the pandemic for the better.

This project includes five parts:

  1. Data Cleaning
  2. Data Visualuzation and Exploratory Data Analysis
  3. Model Selection and Fitting to Data
  4. Impact of Movement Restriction on COVID-19 in New York
  5. Insights about policy and guidance
  6. Conclusion
  7. Reference

1. Data Cleaning

In this part, COVID-19 CSSE John Hopknins datasets will be cleaned firstly to prepare for time series modelling. The selected dataset is between 2020-03-03 to 2020-12-14.

1.1 Load Datasets

The time series data which contains an aggregation of each USA State level confirmed case data is retrived from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

In [5]:
#import dataset
US_case_ori = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv',
                      parse_dates=True)
In [6]:
US_case_ori.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3340 entries, 0 to 3339
Columns: 341 entries, UID to 12/16/20
dtypes: float64(3), int64(332), object(6)
memory usage: 8.7+ MB
In [7]:
US_case_ori.head() 
Out[7]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 12/7/20 12/8/20 12/9/20 12/10/20 12/11/20 12/12/20 12/13/20 12/14/20 12/15/20 12/16/20
0 84001001 US USA 840 1001.0 Autauga Alabama US 32.539527 -86.644082 ... 3043 3087 3117 3186 3233 3258 3300 3329 3426 3510
1 84001003 US USA 840 1003.0 Baldwin Alabama US 30.727750 -87.722071 ... 9821 9974 10087 10288 10489 10665 10806 10898 11061 11212
2 84001005 US USA 840 1005.0 Barbour Alabama US 31.868263 -85.387129 ... 1224 1240 1245 1258 1264 1269 1272 1275 1292 1296
3 84001007 US USA 840 1007.0 Bibb Alabama US 32.996421 -87.125115 ... 1299 1317 1322 1359 1398 1417 1441 1455 1504 1520
4 84001009 US USA 840 1009.0 Blount Alabama US 33.982109 -86.567906 ... 3324 3426 3496 3600 3663 3744 3776 3803 3881 3950

5 rows × 341 columns

Since the data updates daily, to keep the data reproducibility, I decide to select all the data before Dec-14-2020.

In [8]:
US_case= US_case_ori.iloc[:, 0:339]

1.2 Select the location

According to CNN, New York is the epicenter of US COVID spread at the beginning of the pandemic outbreak in March. To be specific, the coronavirus cases in New York are almost ten times more than in other states. Therefore, I decide to study New York’s covid situation.

Step 1: select the required data

In [9]:
#Group by Province
confirmed_state = US_case.groupby(['Province_State']).sum().drop(['Lat', 'Long_'], axis=1)

#reset index
confirmed_state=confirmed_state.reset_index()

#Select New York, and store the dataset into NY_case
NY_case = confirmed_state[confirmed_state["Province_State"]=="New York"]

NY_case
Out[9]:
Province_State UID code3 FIPS 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 12/5/20 12/6/20 12/7/20 12/8/20 12/9/20 12/10/20 12/11/20 12/12/20 12/13/20 12/14/20
36 New York 5378405916 53760 2405916.0 0 0 0 0 0 0 ... 695947 705448 713134 722494 733086 743290 753834 764779 774767 784137

1 rows × 332 columns

Step 2: Transpose to the time series format, and remove the first four rows since there are unrelated

In [10]:
#Transpose the confirmed dataset, and remove the first four rows
NY_case_cleaned = NY_case.T.iloc[4:]

#rename the column
NY_case_cleaned=NY_case_cleaned.rename(columns={36:"NY Confirmed"})

#Make a copy that will be used in EDA 
NY_case_cleaned1=NY_case_cleaned.copy()

#change to date format
NY_case_cleaned.index = pd.to_datetime(NY_case_cleaned.index)

Step 3: Remove zero confirmed cases

In [11]:
NY_case_cleaned2=NY_case_cleaned[NY_case_cleaned['NY Confirmed']!=0]

Show the cleaned dataset

In [12]:
NY_case_cleaned2.head()
Out[12]:
NY Confirmed
2020-03-03 1
2020-03-04 10
2020-03-05 21
2020-03-06 24
2020-03-07 76

2. Data visualization and EDA

In this part, I utilize data visualization and EDA technics to understand how the time series behaves before further exploring the model.

2.1 Original Data

2.1.1 Corona Virus trend in New York

According to the time series plot shown below, there is an overall increasing trend of COVID-19 in New York from March to middle December. In this plot, two time periods are also highlighted. The orange one indicates that the first movement-control-order from the NY government does not perform well; the number of confirmed cases increases rapidly. Therefore, the NY government extend the stay-home-order. The second green highlight is between the extended stay-home order and phase 1 reopening. Notice that as the time of movement restriction increases, the growth rate of confirmed cases becomes slower. It implies the effectiveness of quarantine as it effectively controls the social distancing among people. Furthermore, from the trend graph, it is also worthing noticing that the selected case data is not stationary. To further conduct a time-series model, a transformation will be needed to keep the data stationary.

In [13]:
#Source: https://www.investopedia.com/historical-timeline-of-covid-19-in-new-york-city-5071986
#firstcase = '3/1/20'
stayhomeorder = '2020-03-22'
stayhomeextend1="2020-04-16"
phase1reopen="2020-06-08"

fig = go.Figure()
fig.add_trace(go.Scatter(x=NY_case_cleaned2.index, y=NY_case_cleaned2['NY Confirmed'], name='Confirmed',
                         line=dict(color='blue', width=4)))

fig.add_vrect(x0=stayhomeorder, x1=stayhomeextend1,fillcolor="lightSalmon", opacity=0.5,
    layer="below", line_width=0,)

fig.add_vrect(x0=stayhomeextend1,x1=phase1reopen, fillcolor="lightgreen", opacity=0.5,
              layer="below", line_width=0,)

fig.update_layout(
    title='Corona Virus Trend in New York from 2020-03-03 to 2020-12-14 ',yaxis=dict(
        title='Number of Confirmed Cases'),autosize=False,width=700,height=500)
fig.show()

Test stationary
ADF test helps in estimating whether the time series is stationary. Since the p-value (0.8676) for the Dickey-Fuller(ADF) test is greater than the 5% significant level, the original data is not stationary. The residuals and seasonality for decomposed series have a recurring character, SARIMA model will be conducted.

In [14]:
print('Results of Dickey-Fuller Test without transformation:')
adftest = adfuller(NY_case_cleaned2['NY Confirmed'], autolag='AIC',)
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','Dickey-Fuller p-value','Number of Lags Used',
                                           'Number of Observations Used'])
print(adfoutput)
#The original data is non-stationaryb
Results of Dickey-Fuller Test without transformation:
Test Statistic                  -0.614717
Dickey-Fuller p-value            0.867622
Number of Lags Used             11.000000
Number of Observations Used    275.000000
dtype: float64

2.1.2 Rolling average of daily confirmed COVID-19 cases by window size of 30 days in New York

Next, I provide a rolling window calculation. The following plot shows the rolling average of daily confirmed COVID-19 cases by window size of 30 days. To be specific, the rolling average calculates a 30-day average for the most recent window period. And after that day, the window slides down again and calculates an average of 30 days. The rolling average time series has smoothed variation in daily case count. A small peak of the daily case increases between March and May and drop dramatically in June from the plot. Based on the previous time series plot, May and June are under the extended stay-home-order. On the other side, the number of confirmed cases shoots up after November. Some possible reasons could be the US election and climate change. Furthermore, transformations such as logarithms and differencing can help to stabilise the variance of a time series.

In [15]:
#Get daily increase data
NY_case_cleaned3=NY_case_cleaned2.copy()
NY_case_cleaned3["daily confirmed"]=NY_case_cleaned2["NY Confirmed"]-NY_case_cleaned2["NY Confirmed"].shift()
NY_case_cleaned3["daily confirmed"][0]=NY_case_cleaned2["NY Confirmed"][0]
In [16]:
plt.figure(figsize=(12,5))
plt.plot(NY_case_cleaned3["daily confirmed"],label='Daily')
plt.plot(NY_case_cleaned3["daily confirmed"].rolling(window=30,center=False).mean(),label='Monthly avg mean');
plt.plot(NY_case_cleaned3["daily confirmed"].rolling(window=30,center=False).std(),label='Monthly avg std');
plt.title("30-days average case time series in New York")
plt.legend();

2.1.3 ACF and PACF plots

Autocorrelation function (ACF) graph and partial autocorrelation (PACF) graph help find the initial number of ARIMA models. The ACF describes the degree to which the present value of the series is related to its past value. Time series can contain trends, seasonality, periodicity, and residuals. The ACF considers all of these components while looking for correlations. Meanwhile, the PACF finds the residuals (which remain after the previous lag effects have been explained and removed) are correlated with the next lag value.
According to the plot shown below, the ACF is geometric decay, while PACF significant till two lages. However, since the data is not significant, and I decide to use the ARIMA model, the information caught from both plots will not be used in further analysis.

In [17]:
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(16,6), dpi= 80)
plot_acf(NY_case_cleaned2["NY Confirmed"],ax=ax1,alpha=0.05) #with default value of lags
plot_pacf(NY_case_cleaned2["NY Confirmed"],ax=ax2,alpha=0.05)
plt.show()

2.2 Transformed Data

Log transformation and differences are the preferred approaches to stabilize the time series. So I will first take log transformation and then differencing once the dataset. After that, I will decomposite the time series and recheck the ACF and PACF plots.

2.2.1 Log transformation

In [18]:
NY_case_cleaned2["NY Confirmed"]=NY_case_cleaned2['NY Confirmed'].astype(str).astype(int)
NY_case_cleaned2["NY log Confirmed"]=np.log(NY_case_cleaned2["NY Confirmed"])
In [19]:
NY_case_cleaned2["NY log Confirmed"]=np.log(NY_case_cleaned2["NY Confirmed"])

2.2.2 Seasonal differencing

After transformation, the data is stationary.

In [20]:
first_diff = NY_case_cleaned2["NY log Confirmed"].diff()
first_diff = first_diff.dropna(inplace = False)
print("The p-value for ADF test is:", sm.tsa.stattools.adfuller(first_diff)[1])
The p-value for ADF test is: 6.133959707122101e-26

2.2.3 Time series decompositions

The three components are shown separately in following three panels. These components can be added together to reconstruct the data shown in the top panel. Notice that the seasonal component changes quickly over time, so that any 7 days have similar patterns. The remainder component(residual) shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data. The residuals shows high variability in the early stages of the series.

In [21]:
res = sm.tsa.seasonal_decompose(NY_case_cleaned2["NY log Confirmed"],extrapolate_trend='freq',model='additive')#model="multiplicative"
fig = res.plot()
fig.set_size_inches(18.5, 10.5, forward=True)
fig.show()

2.2.4 ACF and PACF graph

ACF plot helps determine the MA (q) model's order, while the PACF plot helps to set up the order of the AR(p) model. Based on the ACF and PACF plots of transformed data, we can see that from the PACF, we have a significant peak at lag 0. It is significant at every 3 lags. Looking at the ACF plot, we only see a significant until lag 16. In other words, PACF and ACF suggests an AR(16) process with seasonal length of 3.

In [22]:
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(16,6), dpi= 80)
plot_acf(first_diff,ax=ax1,alpha=0.05) #with default value of lags
plot_pacf(first_diff,ax=ax2,alpha=0.05)
plt.show()

3. Model selection and fitting to data

In this part, SARIMA model will be used.

3.1 Split the dataset

Before fitting the model, it is preferred to first split the dataset into train and test. In particular, 70% of the dataset will be the train set, while 30% of the dataset will be the test set.

In [23]:
train,test=train_test_split(NY_case_cleaned2,shuffle=False,random_state=75544)
train_portion = int(len(NY_case_cleaned2)*0.7)
train = NY_case_cleaned2[:train_portion]
test = NY_case_cleaned2[train_portion:]

3.2 SARIMA model

SARIMA model is ARIMA model with a seasonal component. The components of the model is shown as below:

  • p and seasonal P: indicating number of autoregressive terms (lags of the stationarized series)
  • d and seasonal D: indicate differencing that must be done to stationarize series
  • q and seasonal Q: indicate number of moving average terms (lags of the forecast errors)
  • s: indicates seasonal length in the data

In this study, ACF and PACF plots are the first approach to detect the primary case of the SARIMA model, which helps to narrow down to a list of potential parameters. On the other hand, grid search under AIC criteria is another approach to identify the worst and best case models.

3.2.1 Basic Case Model

Based on the rolling average plot, there is a seasonality effect on the dataset. Since PACF and ACF plots suggest AR(16) process with four-month seasonality length, the basic model will be p=3 and s=3.

In [24]:
#P=16, s=3
basic_model=sm.tsa.statespace.SARIMAX(train["NY log Confirmed"], order=(16,1,0), seasonal_order=(0,1,0,3)).fit(disp=-1)
print(basic_model.summary())
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                   NY log Confirmed   No. Observations:                  200
Model:             SARIMAX(16, 1, 0)x(0, 1, 0, 3)   Log Likelihood                 170.642
Date:                            Thu, 17 Dec 2020   AIC                           -307.285
Time:                                    21:35:35   BIC                           -251.557
Sample:                                03-03-2020   HQIC                          -284.723
                                     - 09-18-2020                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.2332      0.067     -3.461      0.001      -0.365      -0.101
ar.L2          0.0907      0.070      1.300      0.193      -0.046       0.227
ar.L3          0.6051      0.046     13.047      0.000       0.514       0.696
ar.L4         -0.1549      0.129     -1.197      0.231      -0.409       0.099
ar.L5         -0.2472      0.092     -2.702      0.007      -0.427      -0.068
ar.L6          0.1093      0.066      1.646      0.100      -0.021       0.239
ar.L7          0.4980      0.095      5.221      0.000       0.311       0.685
ar.L8         -0.2383      0.079     -3.013      0.003      -0.393      -0.083
ar.L9         -0.4026      0.078     -5.142      0.000      -0.556      -0.249
ar.L10        -0.1966      0.076     -2.583      0.010      -0.346      -0.047
ar.L11         0.2768      0.093      2.983      0.003       0.095       0.459
ar.L12         0.0081      0.082      0.099      0.921      -0.152       0.168
ar.L13        -0.1304      0.069     -1.902      0.057      -0.265       0.004
ar.L14         0.1233      0.056      2.201      0.028       0.013       0.233
ar.L15         0.4096      0.051      8.035      0.000       0.310       0.510
ar.L16         0.3241      0.053      6.084      0.000       0.220       0.429
sigma2         0.0097      0.000     20.871      0.000       0.009       0.011
===================================================================================
Ljung-Box (Q):                       63.60   Jarque-Bera (JB):              1723.00
Prob(Q):                              0.01   Prob(JB):                         0.00
Heteroskedasticity (H):               0.00   Skew:                            -1.00
Prob(H) (two-sided):                  0.00   Kurtosis:                        17.39
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

3.2.2 Grid Search SARIMA

Using the gird search, we can fit the model directly and choose the worst and best case model by trying different parameters based on AIC and BIC criteria. p and P are in the range of 3, q and Q are in the range of 3, seasonal length equals to 3, differencing term equals to 1.

In [25]:
p = range(0, 3, 1)
P = range(0, 3, 1) #indicate number of autoregressive terms (lags of the stationarized series)
d = 1 #difference
D = 1 #indicate differencing that must be done to stationarize series
q = range(0, 3, 1) #indicate number of moving average terms (lags of the forecast errors)
Q = range(0, 3, 1)
s = 3 #seasonal length in the data
parameters = product(p,q,P,Q)
parameters_list = list(parameters)
#print(len(parameters_list))
#Return dataframe with parameters, and corresponding AIC
In [26]:
#Return dataframe with parameters, and corresponding AIC
def grid_search_SARIMA(parameters_list,d,D,s, data):
    temp = []
    for param in parameters_list:
        try: 
            model = SARIMAX(data, order=(param[0], d, param[1]), 
                            seasonal_order=(param[2], D, param[3], 
                                            s)).fit(disp=-1)
        except:
            continue
        aic = model.aic
        temp.append([param, aic])
    results = pd.DataFrame(temp)
    results.columns = ['(p,q)x(P,Q)', 'AIC']
    #Sort in ascending order, lower AIC is better
    results = results.sort_values(by='AIC').reset_index(drop=True)
    return results

#We could also use auto_arima function
#auto_arima=pm.auto_arima(train["NY log Confirmed"],start_p=1,start_q=1,
#                        test="adf",max_p=3,max_q=3,m=3,d=1,seasonal=True,
#                        start_P=0,start_Q=0,max_P=16,max_Q=3,D=1,trace=True,error_Action='ignore')
In [27]:
results = grid_search_SARIMA(parameters_list,1,1,3, train["NY log Confirmed"])
results
Out[27]:
(p,q)x(P,Q) AIC
0 (2, 2, 2, 2) -310.444988
1 (0, 2, 2, 2) -303.568706
2 (2, 2, 2, 0) -303.417169
3 (1, 2, 2, 2) -302.488221
4 (2, 0, 2, 2) -299.996912
... ... ...
76 (0, 2, 0, 0) -255.509387
77 (1, 2, 0, 0) -251.104041
78 (1, 0, 0, 0) -249.023823
79 (0, 1, 0, 0) -247.471421
80 (0, 0, 0, 0) -246.001058

81 rows × 2 columns

The model with the largest AIC value will be the worse case model, and vice versa.

3.2.2.1 Worst Case Model

The worst case spread is the SARIMA(0,1,0)(0,1,0) model with seasonal length of 3 as it yields the largest AIC value (-246.001).

In [28]:
worst_model=sm.tsa.statespace.SARIMAX(train["NY log Confirmed"], order=(0,1,0), seasonal_order=(0,1,0,3)).fit(disp=-1)
print(worst_model.summary())
                                     SARIMAX Results                                     
=========================================================================================
Dep. Variable:                  NY log Confirmed   No. Observations:                  200
Model:             SARIMAX(0, 1, 0)x(0, 1, 0, 3)   Log Likelihood                 124.001
Date:                           Thu, 17 Dec 2020   AIC                           -246.001
Time:                                   21:35:56   BIC                           -242.723
Sample:                               03-03-2020   HQIC                          -244.674
                                    - 09-18-2020                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
sigma2         0.0165      0.000     52.678      0.000       0.016       0.017
===================================================================================
Ljung-Box (Q):                       53.26   Jarque-Bera (JB):             24415.72
Prob(Q):                              0.08   Prob(JB):                         0.00
Heteroskedasticity (H):               0.00   Skew:                            -6.31
Prob(H) (two-sided):                  0.00   Kurtosis:                        56.20
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

3.2.2.2 Best Case Model

Similarly, the best case spread is the SARIMA(2,1,2)(2,1,2) model with seasonal length of 3 as it yields the smallest AIC value (-310.445).

In [29]:
best_model=sm.tsa.statespace.SARIMAX(train["NY log Confirmed"], 
                                     order=(2,1,2), 
                                     seasonal_order=(2, 1, 2, 3)).fit(disp=-1)
print(best_model.summary())
                                     SARIMAX Results                                     
=========================================================================================
Dep. Variable:                  NY log Confirmed   No. Observations:                  200
Model:             SARIMAX(2, 1, 2)x(2, 1, 2, 3)   Log Likelihood                 164.222
Date:                           Thu, 17 Dec 2020   AIC                           -310.445
Time:                                   21:35:57   BIC                           -280.942
Sample:                               03-03-2020   HQIC                          -298.501
                                    - 09-18-2020                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -1.7184      0.107    -16.014      0.000      -1.929      -1.508
ar.L2         -0.9646      0.130     -7.402      0.000      -1.220      -0.709
ma.L1          1.5426      0.030     52.083      0.000       1.485       1.601
ma.L2          0.9242      0.025     36.251      0.000       0.874       0.974
ar.S.L3        0.6786      0.040     17.067      0.000       0.601       0.757
ar.S.L6       -0.8554      0.029    -29.867      0.000      -0.912      -0.799
ma.S.L3        0.0842      0.063      1.342      0.180      -0.039       0.207
ma.S.L6        0.9293      0.362      2.564      0.010       0.219       1.640
sigma2         0.0104      0.002      4.216      0.000       0.006       0.015
===================================================================================
Ljung-Box (Q):                       58.30   Jarque-Bera (JB):              4283.74
Prob(Q):                              0.03   Prob(JB):                         0.00
Heteroskedasticity (H):               0.00   Skew:                            -1.44
Prob(H) (two-sided):                  0.00   Kurtosis:                        25.72
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [30]:
print(best_model.plot_diagnostics(figsize=(10,8)))
Figure(720x576)
  • The standardized residual plot: The residuals over time don’t display any obvious seasonality
  • Histogram with estimated density plot: The KDE line follows normal distribution, which indicates the residuals are approximately normally distributed.
  • The Q-Q-plot: Shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples. However, several outliers exists.
  • The Correlogram plot: Residuals have low correlation with their own lagging version

3.3 Performance

3.3.1 Using best model predict the test set

In [31]:
#predict the model the test set
best_model_predict = best_model.predict(start= '2020-09-19' ,
                                    end = '2020-12-14',
                                    dynamic= True)
In [32]:
best_model_predict=np.exp(best_model_predict)
best_model_predict=best_model_predict.to_frame()
best_model_predict=best_model_predict.rename(columns={0: 'Forecast'})
best_model_predict.head()
Out[32]:
Forecast
2020-09-19 446830.012132
2020-09-20 447134.112833
2020-09-21 449600.491795
2020-09-22 450020.174793
2020-09-23 451116.115163

3.3.2 Forecast the next 30 days

In [33]:
#predict the model of the test set
best_model_forecast = best_model.predict(start='2020-12-15',
                                    end = date.today()+timedelta(days=30),
                                    dynamic= True)
In [34]:
best_model_forecast=np.exp(best_model_forecast)
best_model_forecast=best_model_forecast.to_frame()
best_model_forecast=best_model_forecast.rename(columns={0: 'Forecast'})
best_model_forecast.head()
Out[34]:
Forecast
2020-12-15 720939.501167
2020-12-16 723616.042026
2020-12-17 730798.844364
2020-12-18 733395.031411
2020-12-19 736085.211300

3.3.3 Illustration

In [35]:
mse = ((best_model_predict['Forecast'] - test['NY Confirmed']) ** 2).mean()
print('The Mean Squared Error of forecasts is {}'.format(round(mse, 2)))

# mean absolute value on test b
mae = (best_model_predict['Forecast'] - test['NY Confirmed']).mean()
print('The Mean Absolute Error of forecasts is {}'.format(round(mae, 2)))
The Mean Squared Error of forecasts is 1629140421.28
The Mean Absolute Error of forecasts is 26254.51
In [36]:
Trumpdiagnosed=datetime.strptime('2020-10-02', '%Y-%m-%d')
plt.axvline(x=Trumpdiagnosed,color="purple")

plt.plot(train["NY Confirmed"], color='black',label='Original')
plt.plot(best_model_predict, color='red', label='Predict')
plt.plot(test["NY Confirmed"], color='blue',label='Test')
plt.plot(best_model_forecast, color='green', label='Forecast')
plt.legend(loc='best')
plt.title('Forecast of Covid-19 Confirmed Cases in New York in the upcoming 30 days',y=1.05)
plt.xlabel('Date')
plt.xticks(rotation=90)
plt.ylabel('NY total confirmed cases')
plt.show()

The forecast plot shown above illustrates both predicted data on the test set and the forecast data on the 30 days after 2020-12-14. The purple line indicates the date Trump was diagnosed with the coronavirus. The overall trend of the expected data from the best model is similar to the original data but does not follow the trend perfectly. One possible reason might be because I use the log-transformed data instead of differenced data.

3.4 Use Prophet package to forecast the COVID-19 confirmed case

Prohphet is an open source software developed by Facebook Core Data Science Team. It is a process of forecasting time series data based on additive models with non-linear. It is best suited to time series with strong seasonal effects and multiple seasonal historical data and could well handle outliers.

In [37]:
NY_case_cleaned4=NY_case_cleaned2.reset_index()
NY_case_cleaned4=NY_case_cleaned4.drop(columns=["NY log Confirmed"])
NY_case_cleaned4=NY_case_cleaned4.rename(columns={"index": "ds", "NY Confirmed": "y"})
NY_case_cleaned4.head()
Out[37]:
ds y
0 2020-03-03 1
1 2020-03-04 10
2 2020-03-05 21
3 2020-03-06 24
4 2020-03-07 76
In [38]:
#fit the prophet
n = Prophet()
n.fit(NY_case_cleaned4)
# show the future
future = n.make_future_dataframe(periods=30)
future.tail()
# forecast the future
forecast = n.predict(future)
forecast[['ds', 'yhat']].tail()
# Plot the forecasting result
fig1 = n.plot(forecast) #With change points
a = add_changepoints_to_plot(fig1.gca(), n, forecast)
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

The above plot shows the actual and predicted value of confirmed cases. The red lines indicate the positions where the tilt has changed. Notice that the overall trend generated by Prophet is similar to the result generated by SARIMA. Also, the redlines are somehow related to the policy of stay-home-order. In particular, similar to the Corona Virus trend in New York discussed in part 2, middle April and the middle of June are the critical periods that start with the movement restriction policy. The result shows that the pandemic has improved. However, with the relaxation of policies, the number of confirmed cases in New York has grown again. The growth rate is getting even faster.

4. Impact of Movement Restriction on COVID-19 in New York

In this part, two datasets are selected which are Guidotti and Ardia (2020) and San Francisco. The analysis will focus on how and when the interneal movement restriction policy affect the number of daily confirmed cases.

4.1 Load datasets and perform data cleaning

4.1.1 COVID-19 dataset from Guidotti and Ardia (2020)

In [39]:
#Collect the data
policy, src = covid19("USA", level = 2, start = date(2020,3,3),end=date(2020,12,14),verbose=False)
In [40]:
#Locate the location to New York
policy_NY=policy[policy["administrative_area_level_2"]=="New York"].reset_index()
policy_NY=policy_NY.drop(columns=["index","id","latitude","longitude",
                                  "administrative_area_level","administrative_area_level_1",
                                  "administrative_area_level_2","administrative_area_level_3",
                                 "key_google_mobility","key_numeric","key","key_alpha_2",
                                 "key_apple_mobility","iso_alpha_2","iso_alpha_3","iso_numeric","currency"])
In [41]:
#Create new columns "daily confirmed","daily recovered", and "daily deaths"
policy_NY["daily confirmed"]=policy_NY["confirmed"]-policy_NY["confirmed"].shift()
policy_NY["daily recovered"]=policy_NY["recovered"]-policy_NY["recovered"].shift()
policy_NY["daily deaths"]=policy_NY["deaths"]-policy_NY["deaths"].shift()
policy_NY["daily confirmed"][0]=policy_NY["confirmed"][0]
policy_NY["daily recovered"][0]=policy_NY["recovered"][0]
policy_NY["daily deaths"][0]=policy_NY["deaths"][0]
policy_NY.head()
Out[41]:
date tests confirmed recovered deaths hosp vent icu population school_closing ... stay_home_restrictions internal_movement_restrictions international_movement_restrictions information_campaigns testing_policy contact_tracing stringency_index daily confirmed daily recovered daily deaths
0 2020-03-03 1.0 1.0 NaN NaN NaN NaN NaN 23628065.0 0.0 ... 0.0 0.0 3.0 2.0 1.0 1.0 19.44 1.0 NaN NaN
1 2020-03-04 10.0 1.0 NaN NaN NaN NaN NaN 23628065.0 0.0 ... 0.0 0.0 3.0 2.0 1.0 1.0 19.44 0.0 NaN NaN
2 2020-03-05 30.0 3.0 NaN NaN NaN NaN NaN 23628065.0 0.0 ... 0.0 0.0 3.0 2.0 1.0 1.0 19.44 2.0 NaN NaN
3 2020-03-06 122.0 25.0 NaN NaN NaN NaN NaN 23628065.0 0.0 ... 0.0 0.0 3.0 2.0 1.0 1.0 19.44 22.0 NaN NaN
4 2020-03-07 188.0 36.0 NaN NaN NaN NaN NaN 23628065.0 0.0 ... 0.0 0.0 3.0 2.0 1.0 1.0 19.44 11.0 NaN NaN

5 rows × 24 columns

4.1.2 San Francisco’s COVID-19 cases summarized by age group dataset

In [42]:
#Load the dataset
age_group=pd.read_csv("COVID-19_Cases_Summarized_by_Age_Group.csv")

#Group the data by age group
age=age_group[["Age Group","New Confirmed Cases"]] 
age= age.groupby(["Age Group"]).sum()
age=age.reset_index()
In [43]:
age_group.head()
Out[43]:
Specimen Collection Date Age Group New Confirmed Cases Cumulative Confirmed Cases Last Updated at
0 2020/03/13 25-29 3 7 12/14/2020 02:15:02 PM
1 2020/03/14 25-29 2 9 12/14/2020 02:15:02 PM
2 2020/03/15 25-29 1 10 12/14/2020 02:15:02 PM
3 2020/03/16 25-29 1 11 12/14/2020 02:15:02 PM
4 2020/03/17 25-29 0 11 12/14/2020 02:15:02 PM

4.2 Exploratory Data Analysis

4.2.1 Correlation of all factors for New York covid 19 cases

This part will explore the factors that have correlation with the number of daily confirmed cases, and to what extent are these factors related to the number of confirmed cases per day.

In [44]:
nysuit=policy_NY[["daily confirmed","daily recovered","daily deaths","hosp","icu","vent",
                  "school_closing","workplace_closing","cancel_events","gatherings_restrictions",
                 "stay_home_restrictions","internal_movement_restrictions","testing_policy","contact_tracing",
                 "stringency_index"]]
corr_matrix=nysuit.corr()
In [45]:
#Correlation plot
plt.figure(figsize=(9, 10))
mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    sns.heatmap(corr_matrix, mask=mask, square=True,annot=True)
    
plt.yticks(rotation=0)
plt.show()

The above heatmap shows the strength of the linear relationship of New York daily covid-19 cases.

  • Notice that the number of hospitalized patients and the number of patients requiring invasive ventilation are the two variables highly correlated with the number of hospitalized patients in ICUs on the date.
  • Focusing on the number of daily confirmed cases, It is moderately correlated with the number of patients who are hospitalization or in ICU. Among all other factors, the daily confirmed cases are small positively correlated with school closing, while is relatively large positively correlated to workplace closing. Also, it is negatively correlated with internal movement restrictions, but the association is small.

4.2.2 Distribution of daily confirmed case age group in US

In [46]:
fig = px.pie(age, values='New Confirmed Cases', names='Age Group', 
             title='COVID-19 Patients age wise in US from 2020/03/13 to 2020/12/14 ',
             color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

The above pie chart indicates that:

  • The top three age groups with the highest percentage of daily case confirmed rate are 30-19, 40-49, and 25-29, respectively.
  • Exclude unknown, the least three age groups infected the COVID-19 are 11-13,0-4, and 14-17.
  • Middle-aged people are the majority of people infected with the disease. At the same time, the young and the elder are the minority people who confirmed the diseases. Indeed, in the previous parts, I have found that social-distancing is a crucial factor that could affect the pandemic situation. Since the middle-aged make up the bulk of society's workforce, I would like to study the relationship between daily confirmed cases and internal movement restrictions.

4.2.3 Workplace closing suitation per day

In [47]:
policy_NY2=policy_NY.copy()
policy_NY2['date'] = pd.to_datetime(policy_NY2['date'])
policy_NY2.set_index('date', inplace=True)
fig = go.Figure()
fig.add_trace(go.Scatter(x=policy_NY2.index, y=policy_NY2['workplace_closing'], name='workplace closing',
                         line=dict(color='firebrick', width=4)))
fig.update_layout(
    title='Workplace closing suitation Per day',yaxis=dict(title='Workplace closing'),
    autosize=False,width=700,height=500)
fig.show()

The above dynamic plot indicates two important periods: 3/12 to 6/8, and 10/8 to 12/14 because both periods are under the "3" level restrictions: require closing (or work from home).

4.2.4 Daily Covid-trend in New York with two hightlighted periods

According to the time series plot shown below, there are two peaked periods: March to June and October to December. Interestingly, both peaked periods are within the highlighted periods of the third level workplace closing policy. It implies that the NY government will require to close the workplace and work from home once there is an increasing trend. After the workplace closes for a while, the severity of the epidemic abates, and the NY government will reopen workplaces. However, it is hard to detect the relationship's direction as there is a peak within each third level workplace closing restriction.

In [48]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=policy_NY2.index, y=policy_NY2['daily confirmed'], name='Daily Confirmed',
                         line=dict(color='blue', width=4)))
fig.add_trace(go.Scatter(x=policy_NY2.index, y=policy_NY2['daily deaths'], name='Daily Deaths',
                         line=dict(color='firebrick', width=4)))
fig.add_trace(go.Scatter(x=policy_NY2.index, y=policy_NY2['daily recovered'], name='Daily Recovered',
                         line=dict(color='green', width=4)))

fig.add_vrect(x0="2020-03-22", x1="2020-06-07",fillcolor="lightgoldenrodyellow", opacity=0.5,
    layer="below", line_width=0,)

fig.add_vrect(x0="2020-10-08", x1="2020-12-14",fillcolor="lightgoldenrodyellow", opacity=0.5,
    layer="below", line_width=0,)

fig.update_layout(title='Corona Virus Trend in New York',yaxis=dict(title='Number of Covid 19 Cases Per Day'),
    autosize=False,width=700,height=500)
fig.show()

4.3 Linear Regression

Dependet variable: Daily confirmed cases
Independent variable: Internal movement restrictions:

* 0: No measure
* 1: Recommended closing (or significantly reduce volume/route/means of transport)
* 2: Require closing (or prohibit most people from using it))

4.3.1 Data preprocessing

In [49]:
#remove 0 case
df=policy_NY2[policy_NY2['daily confirmed']!=0].reset_index()
df1=df.copy()
df=df.drop(columns=["confirmed","recovered","deaths","daily recovered","daily deaths","date","population"])

df_clean=pd.DataFrame()
df_clean['daily confirmed']=df['daily confirmed']
df_clean["internal_movement_restrictions"]= df["internal_movement_restrictions"]
#df_clean["workplace_closing"]= df["workplace_closing"]

#remove nan value
df_clean=df_clean.dropna()

df_clean.describe()
Out[49]:
daily confirmed internal_movement_restrictions
count 280.000000 280.000000
mean 2580.228571 1.553571
std 2851.961559 0.577960
min 1.000000 0.000000
25% 701.500000 1.000000
50% 1119.500000 2.000000
75% 3515.000000 2.000000
max 11571.000000 2.000000
In [50]:
#Select the variable "workplace_closing"
df_clean["internal_movement_restrictions"]= df_clean["internal_movement_restrictions"].apply(str)

# X1 is independent variable
Y = df_clean['daily confirmed']
# Y is dependent variable that will be predicted based on X1
X = df_clean["internal_movement_restrictions"]


#level the X variable
X_dummy=pd.get_dummies(X)
In [51]:
#The distirbution of Y is skewed, the Box Cox transformation is required
print(plt.hist(Y))
(array([142.,  49.,  17.,   9.,  18.,  12.,  10.,   9.,   5.,   9.]), array([1.0000e+00, 1.1580e+03, 2.3150e+03, 3.4720e+03, 4.6290e+03,
       5.7860e+03, 6.9430e+03, 8.1000e+03, 9.2570e+03, 1.0414e+04,
       1.1571e+04]), <a list of 10 Patch objects>)
In [52]:
#Perform Boxcox transformation
Y_trans,fitted_lambda = scs.boxcox(Y)
fitted_lambda
Out[52]:
0.1875821295912475

4.3.2 Fit the Model

In [53]:
X_dummy=sm.add_constant(X_dummy)        #to add constant value in the model
model= sm.OLS(Y_trans,X_dummy).fit()         #fitting the model
predictions= model.summary()      #summary of the model
predictions
Out[53]:
OLS Regression Results
Dep. Variable: y R-squared: 0.334
Model: OLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 69.51
Date: Thu, 17 Dec 2020 Prob (F-statistic): 3.45e-25
Time: 21:36:05 Log-Likelihood: -773.52
No. Observations: 280 AIC: 1553.
Df Residuals: 277 BIC: 1564.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 9.6029 0.303 31.644 0.000 9.005 10.200
0.0 -4.7195 0.843 -5.598 0.000 -6.379 -3.060
1.0 8.6260 0.407 21.196 0.000 7.825 9.427
2.0 5.6964 0.370 15.415 0.000 4.969 6.424
Omnibus: 21.598 Durbin-Watson: 0.093
Prob(Omnibus): 0.000 Jarque-Bera (JB): 22.075
Skew: 0.641 Prob(JB): 1.61e-05
Kurtosis: 2.504 Cond. No. 2.21e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.64e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The summary result of the model is shown above. Since the p-value for F statistics is less than a 5% significant level, we can conclude that at least one level has a linear relationship with the number of daily confirmed cases. Also, the coefficients for level 1 is smaller than level 2, which means requiring closing will lead to fewer confirmed cases, implying a positive impact of movement control.

4.4 ANOVA

4.4.1 Data Preprocessing

In [54]:
df2=pd.DataFrame(X)
df2["daily confirmed"]=Y_trans
df2.groupby(df2["internal_movement_restrictions"])
Out[54]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdd25330310>

4.4.2 Perform one-way ANOVA

In [55]:
scs.f_oneway(df2["daily confirmed"][df2["internal_movement_restrictions"]== '0.0'],
               df2["daily confirmed"][df2["internal_movement_restrictions"]  == '1.0'],
               df2["daily confirmed"][df2["internal_movement_restrictions"] == '2.0'])
Out[55]:
F_onewayResult(statistic=69.5059664790454, pvalue=3.4458498700867286e-25)

$H_0: \mu_{level 0} = \mu_{level 1} = \mu_{level 3}$
$H_a$ : Means are not all equal
Since the p value is less then 0.05, we reject the null hypothesis, and can conclude that the mean of daily confirmed cases is significantly different for at least one of the workplace closing levels, which is the same result generated from the linear regression model.

4.4.3 Perform a post-hoc test

In [56]:
result = mc.MultiComparison(df2["daily confirmed"],df2["internal_movement_restrictions"])
post_hoc_res = result.tukeyhsd()
post_hoc_res.summary()
Out[56]:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
0.0 1.0 13.3455 0.001 10.5726 16.1184 True
0.0 2.0 10.4159 0.001 7.7019 13.13 True
1.0 2.0 -2.9296 0.001 -4.0743 -1.7849 True

Through the post-hoc test, one can conclude that:

  • The p values of turkey pairwise comparisons are all less than 5% significant level
  • Reject the null hypothesis that the means of daily confirmed for each pair of restriction group are the same
  • All groups (restriction level) differ in their daily confirmed cases.

5. Insights about policy and guidance to tackle the outbreak based on model findings

In this part, insights on coronavirus will be delivered based on the finding in two analyses. This section will also discuss how healthcare professionals, industry, and governments can best use the insights to assist in the fight against the COVID-19 pandemic.

5.1 Time Serie Analysis on using JHU CSSE COVID-19 Daily Report Dataset

  • Time-series analysis is conducted using the Johns Hopkins University Center for Systems and Engineering (JHU CSSE) COVID-19 dataset in the first study. After data cleaning, an exploratory data analysis is performed to check the trend of the daily total confirmed cases in NY between 2020-03-03 to 2020-12-14.
  • Next, a statistical model is an important technology for real-time analysis of infectious disease data. A SARIMA model is used since it effectively considers serial linear correlation among observation with seasonal factors and works for a non-stationary dataset. After using grid search methods, the optimal parameters of the model are chosen to predict the model. This analysis also forecasts the one month after 2020-12-14 and finds a relatively decreasing tendency. The forecasting result is compared with Prophet, open-source software released by Facebook's Core Data Science team.

5.1.1 Insights from EDA

The time series plot of cumulative confirmed case shown below, there is an overall increasing trend of COVID-19 in New York from March to middle December. The yellow area highlights the period of movement restriction order, which include school closure, workplace closure, and internal movement control. As the stay-at-home order extends, the growth rate of confirmed cases becomes slower, which implies the effectiveness of quarantine as it effectively controls the social distancing among people.

In [57]:
stayhomeorder = '2020-03-22'
phase1reopen="2020-06-08"
fig = go.Figure()
fig.add_trace(go.Scatter(x=NY_case_cleaned2.index, y=NY_case_cleaned2['NY Confirmed'], name='Confirmed',
                         line=dict(color='orange', width=4)))
fig.add_vrect(x0=stayhomeorder,x1=phase1reopen, fillcolor="lightyellow", opacity=0.5,
              layer="below", line_width=0,)
fig.update_layout(
    title='Corona Virus Trend in New York from 2020-03-03 to 2020-12-14 ',yaxis=dict(
        title='Number of Confirmed Cases'),autosize=False,width=600,height=400)
fig.show()

5.1.2 Insights from SARIMA model

  • From the first forecast plot based on the SARIMA model, the red and green curve's combination shows the predicted trend from September to next year January. The red predicted trend is increasing without following the original test trend perfectly. However, the green forecasting result indicates that the daily confirmed tendency is not as serious as the original dataset even though there is an increasing trend of daily confirmed in New York next month.

  • The second forecast plot based on Prohpht illustrates the same situation. However, the response variable is the number of daily confirmed cases in New York. The daily confirmed case will keep increase, but the increase rate is relatively slower.

In [58]:
plt.plot(train["NY Confirmed"], color='black',label='Original')
plt.plot(best_model_predict, color='red', label='Predict')
plt.plot(test["NY Confirmed"], color='blue',label='Test')
plt.plot(best_model_forecast, color='green', label='Forecast')
plt.legend(loc='best')
plt.title('Forecast of Covid-19 Confirmed Cases in New York in the upcoming 30 days',y=1.05)
plt.xlabel('Date')
plt.xticks(rotation=90)
plt.ylabel('NY total confirmed cases')
plt.show()
In [59]:
NY_case_cleaned5=NY_case_cleaned3.reset_index()
NY_case_cleaned5=NY_case_cleaned5.drop(columns=["NY Confirmed"])
NY_case_cleaned5=NY_case_cleaned5.rename(columns={"index": "ds", "daily confirmed": "y"})
n = Prophet()
n.fit(NY_case_cleaned5)
# show the future
future = n.make_future_dataframe(periods=30)
future.tail()
# forecast the future
forecast = n.predict(future)
forecast[['ds', 'yhat']].tail()
# Plot the forecasting result
fig1 = n.plot(forecast, figsize=(8, 4)) #With change points
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

5.2 Study of the Impact of Movement Restriction on COVID-19 using Guidotti and Ardia (2020) and San Francisco Datasets

  • Two data are selected in this analysis. The first one is from COVID-19 Data Hub which contains worldwide fine-grained case data and is merged with exogenous variables to help a better understanding of COVID-19. The second one is the San Francisco's COVID-19 cases summarized by age group, which includes positive confirmed cases by age group.
  • In this analysis, I check the correlation of daily confirmed cases in New York with other variables in the covid data hub dataset, and find that there is no factor highly correlated with the daily confirmed, and among all policy factors, workplace closing has the highest correlation with the daily confirmed cases. Meanwhile, the internal movement restrictions only has small negative association with the daily confirmed case.
  • Move on, I detect the distribution of distribution of daily confirmed case age group in US, and investigate that middle-age people are the majority of people infected with the disease. Since those group of people occur a main portion of labour force, this study will analyze the relationship between movement control and daily confirmed cases.
  • In this analysis, both linear regression and ANOVA are used to detect the impact of movement restrictions on the spread of covid-19 disease. Furthermore, a post hoc test is utilized to detect which means are different from one another.

5.2.1 Insights from EDA

By examining the daily confirmed curve in New York above, one could extract some exciting behaviors.

  • For instance, the blue curve shows that New York suffering many confirmed between March and May. Therefore, New York managed to stop spreading the virus and reached the goal by closing the workplaces until June. The trend of daily deaths in that period is similar to every day confirmed case. Workplace closure leading to a decrease in daily deaths. From June to September, the everyday confirmed case is lower than 2000; yet, the red curve shows that daily deaths remain constant. It is also interesting to notice that the number of daily confirmed cases still increases under a workplace closure order, similar to the trend between March and April.
  • According to the previous forecast on the confirmed case in the upcoming 30 days, there is an increasing trend in NY. The workplace closure order might have to extend to restrain confirmed case growth.
In [60]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=policy_NY2.index, y=policy_NY2['daily confirmed'], name='Daily Confirmed',
                         line=dict(color='blue', width=4)))
fig.add_trace(go.Scatter(x=policy_NY2.index, y=policy_NY2['daily deaths'], name='Daily Deaths',
                         line=dict(color='firebrick', width=4)))

fig.add_vrect(x0="2020-03-22", x1="2020-06-07",fillcolor="lightgoldenrodyellow", opacity=0.5,
    layer="below", line_width=0,)

fig.add_vrect(x0="2020-10-08", x1="2020-12-14",fillcolor="lightgoldenrodyellow", opacity=0.5,
    layer="below", line_width=0,)

fig.update_layout(title='Corona Virus Trend in New York',yaxis=dict(title='Number of Covid 19 confirmed Cases Per Day'),
    autosize=False,width=600,height=400)
fig.show()

5.2.2 Insights from Linear Regression and ANOVA

Both results of linear regression and ANOVA are significant. From linear regression model, one can conclude that at least one level has a linear relationship with the number of daily confirmed cases. Notice that the coefficients for level 1 is smaller than level 2, which means requiring closing will lead to fewer confirmed cases, which implies a positive impact of movement control.
In addition, the result of a post hoc test is shown below. Through the post-hoc test, one can conclude that all restriction groups differ in their mean of daily confirmed cases, implying internal movement restrictions effectiveness.

In [61]:
X_dummy=sm.add_constant(X_dummy)      
model= sm.OLS(Y_trans,X_dummy).fit()     
model.summary()      
Out[61]:
OLS Regression Results
Dep. Variable: y R-squared: 0.334
Model: OLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 69.51
Date: Thu, 17 Dec 2020 Prob (F-statistic): 3.45e-25
Time: 21:36:07 Log-Likelihood: -773.52
No. Observations: 280 AIC: 1553.
Df Residuals: 277 BIC: 1564.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 9.6029 0.303 31.644 0.000 9.005 10.200
0.0 -4.7195 0.843 -5.598 0.000 -6.379 -3.060
1.0 8.6260 0.407 21.196 0.000 7.825 9.427
2.0 5.6964 0.370 15.415 0.000 4.969 6.424
Omnibus: 21.598 Durbin-Watson: 0.093
Prob(Omnibus): 0.000 Jarque-Bera (JB): 22.075
Skew: 0.641 Prob(JB): 1.61e-05
Kurtosis: 2.504 Cond. No. 2.21e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.64e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [62]:
result = mc.MultiComparison(df2["daily confirmed"],df2["internal_movement_restrictions"])
post_hoc_res = result.tukeyhsd()
post_hoc_res.summary()
Out[62]:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
0.0 1.0 13.3455 0.001 10.5726 16.1184 True
0.0 2.0 10.4159 0.001 7.7019 13.13 True
1.0 2.0 -2.9296 0.001 -4.0743 -1.7849 True

6. Conclusion

This project conducts two analyses that forecast New York COVID-19 confirmed cases in the policy perspective, aiming to help policymakers plan ahead and take action. In this study, main findings are:

  • There is an increasing trend of daily confirmed in New York next month; yet, the growing tendency is not as severe as the current.
  • People aged between 25-49 are the majority group who get infected with COVID-19 in New York.
  • Internal movement restriction does have a positive impact on reducing the spread of the pandemic. As an increasing trend forecasted previously, the movement control is required to extend to reduce the tendency of COVID-19 spread in New York.

7. Reference

Datasets:

Other: